{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Isi Penting Generator: News Style"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate a long text output with the style of a news article when given important facts (isi penting in Malay)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/isi-penting-generator-news-style](https://github.com/huseinzol05/Malaya/tree/master/example/isi-penting-generator-news-style).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "\n",
    "The results you see here are generated using stochastic methods. Learn more about the stochastic process on <a href=\"https://en.wikipedia.org/wiki/Stochastic_process\" target=\"_blank\">Wikipedia</a>\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 3.07 s, sys: 445 ms, total: 3.51 s\n",
      "Wall time: 3.31 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/maguswyvern/PythonVenvs/dev-malaya/lib/python3.10/site-packages/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import malaya\n",
    "from pprint import pprint"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List all available HuggingFace transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `malaya` library has a built in function to find all available transformers for this task. As of writing we have two transformers which are:\n",
    "\n",
    "1. mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased <br>\n",
    "https://huggingface.co/mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased\n",
    "   \n",
    "2. mesolitica/finetune-isi-penting-generator-t5-small-standard-bahasa-cased <br>\n",
    "https://huggingface.co/mesolitica/finetune-isi-penting-generator-t5-small-standard-bahasa-cased"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'mesolitica/finetune-isi-penting-generator-t5-small-standard-bahasa-cased': {'Size (MB)': 242,\n",
       "  'ROUGE-1': 0.24620333,\n",
       "  'ROUGE-2': 0.05896076,\n",
       "  'ROUGE-L': 0.15158954,\n",
       "  'Suggested length': 1024},\n",
       " 'mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased': {'Size (MB)': 892,\n",
       "  'ROUGE-1': 0.24620333,\n",
       "  'ROUGE-2': 0.05896076,\n",
       "  'ROUGE-L': 0.15158954,\n",
       "  'Suggested length': 1024}}"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.generator.isi_penting.available_huggingface"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load HuggingFace\n",
    "\n",
    "The Generator transformer in `malaya` is quite unique, most of the text generative model we found on the internet like GPT2 or Markov simply just continue the prefix input from user, but not for our Generator transformer. \n",
    "\n",
    "We want to generate an article or karangan like high school when the users give 'isi penting' or important facts for the article."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "def huggingface(\n",
    "    model: str = 'mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased',\n",
    "    force_check: bool = True,\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    Load HuggingFace model to generate text based on isi penting.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model: str, optional (default='mesolitica/finetune-isi-penting-generator-t5-base-standard-bahasa-cased')\n",
    "        Check available models at `malaya.generator.isi_penting.available_huggingface`.\n",
    "    force_check: bool, optional (default=True)\n",
    "        Force check model one of malaya model.\n",
    "        Set to False if you have your own huggingface model.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: malaya.torch_model.huggingface.IsiPentingGenerator\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n"
     ]
    }
   ],
   "source": [
    "model = malaya.generator.isi_penting.huggingface()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the `generate` function and the parameters it expects. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "def generate(\n",
    "    self,\n",
    "    strings: List[str],\n",
    "    mode: str = 'surat-khabar',\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    generate a long text given a isi penting.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    strings : List[str]\n",
    "    mode: str, optional (default='surat-khabar')\n",
    "        Mode supported. Allowed values:\n",
    "\n",
    "        * ``'surat-khabar'`` - news style writing.\n",
    "        * ``'tajuk-surat-khabar'`` - headline news style writing.\n",
    "        * ``'artikel'`` - article style writing.\n",
    "        * ``'penerangan-produk'`` - product description style writing.\n",
    "        * ``'karangan'`` - karangan sekolah style writing.\n",
    "\n",
    "    **kwargs: vector arguments pass to huggingface `generate` method.\n",
    "        Read more at https://huggingface.co/docs/transformers/main_classes/text_generation\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Benefits of HuggingFace\n",
    "\n",
    "With the `generate` method you can use Greedy, Beam, Sampling, Nucleus decoder and so much more, read more about it on the [HuggingFace Article on How to Generate](https://huggingface.co/blog/how-to-generate). And recently, HuggingFace also released a new article [Introducing Csearch](https://huggingface.co/blog/introducing-csearch)\n",
    "\n",
    "Let's give a few lines of important facts or isi penting for the model to use to generate text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "isi_penting = ['Dr M perlu dikekalkan sebagai perdana menteri',\n",
    "              'Muhyiddin perlulah menolong Dr M',\n",
    "              'rakyat perlu menolong Muhyiddin']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['KUALA LUMPUR: Presiden Parti Pribumi Bersatu Malaysia (Bersatu), Tan Sri '\n",
      " 'Muhyiddin Yassin perlu dikekalkan sebagai perdana menteri, kata Presiden '\n",
      " 'Parti Amanah Negara (Amanah), Mohamad Sabu. Mohamad berkata, Muhyiddin '\n",
      " 'perlulah membantu Dr M. Muhyiddin untuk membangunkan negara termasuk '\n",
      " 'menambah baik ekonomi dan kewangan negara. \"Sebagai presiden parti, tugas '\n",
      " 'dan peranan utama Dr M adalah untuk membantu rakyat supaya negara boleh '\n",
      " 'maju. \"Rakyat akan merasa terkesan (daripada tindakan Muhyiddin),\" katanya '\n",
      " 'kepada Astro AWANI di sini pada Rabu. Mohamad berkata demikian ketika '\n",
      " 'mengulas kenyataan Pengerusi Pakatan Harapan (PH), Tun Dr Mahathir Mohamad '\n",
      " 'yang mencadangkan Dr Mahathir sebagai perdana menteri interim. Tambah '\n",
      " 'Mohamad, Perdana Menteri yang juga Pengerusi Pakatan Harapan juga mempunyai '\n",
      " 'tugas penting dalam menyelesaikan pelbagai masalah rakyat yang perlu diberi '\n",
      " 'perhatian. \"Saya fikir Dr Mahathir akan bantu (penubuhan kerajaan interim), '\n",
      " 'Dr Mahathir memang baik. \"Dr Mahathir bukan saja boleh memberi cadangan '\n",
      " 'supaya dia jadi perdana menteri interim, malah kita kena buat kerjalah bukan '\n",
      " 'sahaja kerja dan kerja dengan baik,\" katanya lagi. Dalam pada itu, Mohamad '\n",
      " 'berkata, Bersatu dan Pas juga bersetuju untuk tidak bertanding dalam Pilihan '\n",
      " 'Raya Umum (PRU) akan datang. \"Pas bersetuju untuk duduk semeja dengan PH '\n",
      " 'dalam PRU akan datang.']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'surat-khabar',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    top_k=50, \n",
    "    top_p=0.95,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ULASAN | Muhyiddin Yassin telah menyatakan bahawa Dr Mahathir Mohamad perlu '\n",
      " 'dikekalkan sebagai perdana menteri. Beliau juga telah menyatakan bahawa jika '\n",
      " 'Dr Mahathir kekal dengan pendiriannya, maka Dr Mahathir perlulah menolong '\n",
      " 'beliau. Muhyiddin juga telah menyatakan bahawa beliau perlu mengekalkan '\n",
      " 'jawatannya sebagai perdana menteri. Namun, jika Dr Mahathir masih kekal, '\n",
      " 'maka rakyat tidak mahu Muhyiddin menjadi Perdana Menteri. Ini kerana beliau '\n",
      " 'tidak mempunyai sokongan majoriti ahli parlimen untuk menubuhkan kerajaan '\n",
      " 'baharu. Muhyiddin telah menyatakan bahawa beliau perlu terus kekal sebagai '\n",
      " 'perdana menteri. Jika Mahathir kekal, maka rakyat akan memberi sokongan '\n",
      " 'kepada Muhyiddin. Jika Muhyiddin masih kekal, maka Dr Mahathir akan terus '\n",
      " 'memegang jawatan tersebut. Jika Dr Mahathir kekal, maka Dr Mahathir akan '\n",
      " 'terus kekal sebagai perdana menteri. Jika Muhyiddin kekal, Dr Mahathir akan '\n",
      " 'terus kekal sebagai perdana menteri. Jika Dr Mahathir kekal, maka rakyat '\n",
      " 'akan memberi sokongan kepada Muhyiddin. Jika Muhyiddin masih kekal, rakyat '\n",
      " 'akan memberi sokongan kepada Dr Mahathir sebagai perdana menteri. Jika '\n",
      " 'Muhyiddin kekal, Dr Mahathir akan kekal sebagai perdana menteri. Jika '\n",
      " 'Muhyiddin kekal, rakyat akan memberi sokongan kepada Dr Mahathir. Jika '\n",
      " 'Muhyiddin kekal, maka Dr Mahathir akan kekal sebagai perdana menteri. Jika '\n",
      " 'Dr Mahathir kekal, maka rakyat akan memberi sokongan kepada Dr Mahathir '\n",
      " 'untuk kekal sebagai perdana menteri. Adakah Muhyiddin akan kekal sebagai '\n",
      " 'perdana menteri jika Dr Mahathir kekal dengan pendiriannya? Jika Dr Mahathir '\n",
      " 'kekal, maka']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'surat-khabar',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    penalty_alpha=0.8, top_k=4,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "isi_penting = ['Neelofa tetap dengan keputusan untuk berkahwin akhir tahun ini',\n",
    "              'Long Tiger sanggup membantu Neelofa',\n",
    "              'Tiba-tiba Long Tiger bergaduh dengan Husein Zolkepli']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As above, we can give any isi penting even if it does not make any sense. Now we'll use the `generate` method and pass in a few of the vector arguments mentioned in a previous linked article by HuggingFace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['VOKALIS kumpulan Monologue, Neelofa tetap dengan keputusan untuk berkahwin '\n",
      " 'akhir tahun ini. Kata Neelofa, ia sebagai langkah persediaannya selepas '\n",
      " 'menghadapi saat getir seperti dilalui bekas kekasihnya, Wan Raja yang baru '\n",
      " 'mendirikan rumah tangga dengan Datuk Husein Zolkepli pada 1 April lalu. '\n",
      " 'Bagaimanapun kata Neelofa, setiap orang pasti mempunyai hal sendiri dalam '\n",
      " 'menentukan pasangan masing-masing. \"Saya sangat bersyukur dan bersyukur. Ini '\n",
      " 'masa yang terbaik untuk kami sekeluarga dan berharap agar perkahwinan ini '\n",
      " 'kekal bahagia hingga akhir hayat. \"Tetapi saya perlu ingat untuk kekal dalam '\n",
      " 'perkahwinan, setiap orang pasti ada hal sendiri. Dalam hal ini, apa yang '\n",
      " 'penting adalah perkahwinan. \"Saya juga mendoakan semoga perkahwinan ini '\n",
      " 'berkekalan hingga ke akhir hayat. Apa yang penting adalah perkahwinan ini '\n",
      " 'kekal bahagia hingga ke akhir hayat,\" ujarnya. Long Tiger sanggup bantu '\n",
      " 'Neelofa. Tiba-tiba Long Tiger bergaduh dengan Husein Zolkepli. \"Adik saya, '\n",
      " 'Neelofa pun sudah lama kenal saya. Saya harap dia akan terus bersama saya '\n",
      " 'selama-lamanya,\" ujarnya lagi. BACA: \"Kami sudah putus asa, saya dan Husein '\n",
      " 'belum putuskan lagi\" - Neelofa BACA: \\'Saya']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, do_sample=True,\n",
    "    max_length=256,\n",
    "    top_k=50, \n",
    "    top_p=0.9, ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Previously we set the `top_k` parameter to `50`. A higher `top_k` value means the model considers more candidates, potentially leading to more diversity in the generated text but also increasing the computational cost.\n",
    "\n",
    "Now let's try lowering the parameter down and introduce the `penalty_alpha` argument to decrease randomness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['KUALA LUMPUR: Pelakon, pengacara, pengacara dan usahawan, Neelofa tetap '\n",
      " 'dengan keputusan untuk berkahwin akhir tahun ini, walaupun sudah bernikah '\n",
      " 'dengan bekas isteri, Datin Husein Zolkepli. Menerusi satu entri di '\n",
      " 'Instagram, Neelofa atau nama sebenarnya, Noor Neelofa Mohd Noor, 27, '\n",
      " 'berkata, dia tidak pernah gentar dengan Long Tiger. \"Tak pernah gentar '\n",
      " 'dengan Long Tiger. Long Tiger tak pernah gentar dengan saya. \"Saya tak '\n",
      " 'pernah gentar dengan Long Tiger. Long Tiger sanggup membantu saya. Tiba-tiba '\n",
      " 'Long Tiger bergaduh dengan Husein. \"Saya pun tak pernah nak bergaduh dengan '\n",
      " 'mereka. Saya pun tak pernah nak bergaduh dengan mereka,\" katanya. View this '\n",
      " 'post on Instagram #twtwtwt #twtwt #twtwt #twtwt #twtwt #twtwt #twtwt #twtwt '\n",
      " '#twt #twtwt #twt #twtwt #twtwt #tw']\n"
     ]
    }
   ],
   "source": [
    "pprint(model.generate(isi_penting, mode = 'surat-khabar',\n",
    "    do_sample=True, \n",
    "    max_length=256, \n",
    "    penalty_alpha=0.8, top_k=4,))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
